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Background / Context: 

Recent policy has charged schools and districts with maintaining highly qualified 
teachers and differentiating among teachers in terms of their effectiveness (U.S. Department of 
Education, 2009). This emphasis has driven the development and implementation of teacher 
quality measures which are increasingly being used to evaluate teachers with important 
consequences (Schochet & Chiang, 2010). One increasingly common component of these 
evaluations is the direct observation of teachers in their classrooms. Classroom observations 
have been long viewed as a promising way to evaluate and develop teachers because they anchor 
assessments in specific and observable criteria (Gitomer, 2009). 

Despite the potential of classroom observations to identify strengths and address specific 
weaknesses in teachers’ practices (MET, n.d.), the systems used to conduct classroom 
observations tend to be influenced by aspects of the observational environment beyond the 
teaching quality (Kennedy, 2010). For instance, many observation systems evaluate teaching 
using fallible indicators and raters and generally draw inferences using only a small sample of 
teachers' lessons. Such features potentially introduce bias and imprecision into teaching quality 
assessments. Combined with the fact that these measures often form the basis for many high 
stakes decisions, the robustness of teaching quality scores to these features has taken on 
increasing importance. Yet despite this importance, relatively little is known about the accuracy 
and precision of these scores amidst construct-irrelevant variation and the extent to which 
different treatments of this variation arrive at similar scores and precision levels. 

Two approaches to scoring classroom observations of teaching quality have largely 
dominated this literature: classical test theory (CTT) and generalizability theory (GT). In this 
proposal, we developed a third alternative approach based on item response theory (IRT). Each 
approach incorporates measurement error into their framework; however, they do so in distinctly 
different ways. We very briefly outline the features of these approaches as they apply to 
treatments of construct-irrelevant sources of variation. 

Both CTT and GT construct estimates of teaching quality by summing or averaging 
across all items and observations. As a result, CTT/GT assumes that the original (ordinal) 
teaching ratings all hold equal amounts of information and are continuously scaled such that 
scores are created by averaging over any construct-irrelevant variation (e.g., raters). CTT then 
estimates a single level of reliability and measurement error for each teacher by simply 
decomposing observed variance into true and error score variance. GT refines this approach by 
further decomposing the error variance among sources (e.g., raters, occasions). GT then 
estimates a single level of reliability and precision for all indices by assuming that units within 
each source of variation (both current and future) are exchangeable (e.g., teachers or raters are 
exchangeable). As a result, although CTT/GT acknowledges the influence of construct-irrelevant 
variance on the scores, their reliability and their precision, the reported scores make no 
adjustments for these errors and allow construct-irrelevant variance to accumulate as 
measurement error. 

In contrast to these methods, an IRT based approach does not assume raw ratings are 
continuously scaled and hold equal information and instead estimates the latent trait theoretically 
underlying teachers' observed rating patterns by postulating a probabilistic response model. 
Similar to GT, our extension of the common IRT model to classroom observations recognizes 
the influence of multiple sources of construct-irrelevant variation. However, in contrast to GT, 
our IRT based approach adjusts for construct-irrelevant variation to provide a measure of 
teaching quality that is as independent as possible of the sources of construct-irrelevant variation. 
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Purpose / Objective / Research Question / Focus of Study: 

In this proposal, we investigated the robustness of classroom observation scores to three 
measurement approaches: CTT, GT, and IRT. We investigated the extent to which choices 
among these approaches lead to indeterminacies in conclusions regarding teaching quality, the 
precision with which we can index this quality, and the relation of this quality to student 
achievement. We then provide insights delineating the reasons for the discrepancies among 
methods and explore the extent to which the observed differences are indicative of true 
differences among teachers in their teaching quality. 

Setting & Population / Participants / Subjects & Intervention / Program / Practice: 

This study takes place within the larger Developing Measures of Effective Mathematics 
Teaching study which focused on developing identifying practices and characteristics that 
distinguish between more and less effective teachers. The sample includes 250 teachers from 40 
schools and 4 districts. Table 1 presents a few basic descriptive statistics. Our study drew on the 
Mathematical Quality of Instruction (MQI) classroom observation system (Hill et al., 2008; 
(Hill, Charalambous, & Kraft, 2012). The MQI system was designed to provide assessments for 
teachers on important dimensions of classroom mathematics instruction. The structure of this 
system was developed to provide a multidimensional and balanced view of mathematics 
instruction (Hill et al., 2008). In the current investigation we studied each of the system's 
dimensions but present only the general dimension for proposal brevity. 


Statistical, Measurement, or Econometric Model: 

To index teaching quality using CTT, we averaged ratings across items, chapters, and 
raters thus collapsing across all sources of construct-irrelevant variation. Let 
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where dj is the estimate of teaching quality for teacher t across all items, chapters, and raters. 


Y ictr is the score for item i in chapter c for teacher t given by rater r, R, is the total number of 
raters for chapter c, C t is the number of chapters observed for teacher t, and / is the total number 
of items. To describe the uncertainty associated with CTT scores of teaching quality, we used 
CTT's concept of the standard error of measurement. Specifically, define coefficient alpha as 


2X 


a = ( )(1 

7-1 


( 2 ) 


where / is the number of items, a] is the variance of item i across teachers and a], is the 
variance of the observed total scores. Standard errors were obtained using 

Standard Error of Measurement (SEM) = <j y sll - a (3) 


Confidence intervals were formed using each teacher's score plus or minus double the SEM. 

To describe teaching quality using GT we also scored teachers using equation (1). 
Subsequently, we constructed standard errors using the SEM (3) replacing a with 


(4) 
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were a] is the teacher variability, a) is rater variability with R (=2) as the number of 
raters/observation, and <y] is the chapter variability with C as the average number of observed 
chapters for teacher t. Teacher variation, a ] , stems from stable differences among teachers in 
terms of their consistent quality; rater variation (cr ) arises from differences among observers in 
their severity; chapter variation ( a 2 c ) manifests when teachers' quality varies across chapters. Our 
partially crossed design largely precludes estimation of remaining interactions among these 
components, however, we did also consider the rater-by-teacher interaction. 

To describe teaching using IRT, we developed a multilevel graded response model 


P(Y lclr =k\9) = P(Y lctr >k\6)-P(Y ictr >k + \\ 0) 
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Here Y ictr is the score for item i in chapter c for teacher t rated by rater r, a,- represents the 
discrimination parameter for item /, 0 t represents teacher t's stable level of teacher quality and is 
assumed to be normally distributed, y r is a fixed effect for rater r's level of leniency and a ctr is the 
deviation specific to chapter c for teacher t rated by rater r with assumed normal distribution 
iV(0, cr 2 ) . Further, let K represent the number of categories items are graded on (three with MQI) 
with k as a specific category and let d < i ' ) ,..., d\ K ~ l) be a set of K - 1 ordered item difficulty 
intercepts. To identify the scale, we set 9 to have a normal distribution with mean zero and unit 
variance. Estimation was carried out using maximum likelihood with corresponding quality 
levels estimated using an expected a posteriori approach. Finally, the standard errors of scores 
were estimated using the posterior standard deviations obtained from the second derivative of the 
log-likelihood function. Symmetric confidence intervals were formed using each teacher's score 
plus or minus double the posterior standard deviation. 

Research Design: 

Using 40 raters, the study observed 250 teachers across three time points. Observations 
were broken into several chapters of about seven minutes in length and two raters provided 
ordinal ratings (scores of 1, 2, or 3) of teachers' instruction along 21 different items/indicators. 
Below we describe the results for a single dimension, the general or overall quality of teaching. 

Findings / Results: 

Overall there were significant discrepancies among results suggesting that estimates of 
teaching quality and their precision were sensitive to scoring approach. We very briefly highlight 
only a few findings. Indices constructed using CTT indicated the (alpha) reliability of teachers' 
scores was on the order of 0.48 whereas GT results suggested that the reliability of the average 
was 0.60 (Table 2). Our application of IRT both acknowledged and adjusted for the presence of 
construct-irrelevant variance, and as a result, it recorded higher levels of reliability despite the 
presence of a substantial amount of construct-irrelevant variance. Because information/reliability 
is a function of the latent trait in IRT, we described the reliability of IRT scores by presenting the 
average reliability for teachers at each level of the continuum (Figure 1). Overall, the average 
reliability of IRT scores fluctuated between 0.86 for low quality teachers and 0.93 for high 
quality teachers, although for specific teachers the reliabilities ranged from 0.72 to 0.95. 

In assessing the extent to which the methods agreed on the actual values of teachers' 
scores, we observed a correlation between CTT/GT scores and IRT scores of 0.82. A scatter plot 
of the scores indicated that the standardized CTT/GT scores generally had a wider range than the 
IRT scores (Figure 2). CTT/GT scores ranged from four SDs below the mean to three SDs above 
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the mean whereas IRT scores were shrunken toward the mean with a range between negative 
three SDs and positive two SDs. 

Corresponding to the aforementioned discrepant reliabilities, we also found substantial 
differences in the precisions with which the methods could index teaching quality scores (Figure 
3). In particular, because CTT does not discriminate among sources of variation with regard to 
indexing reliability and GT acknowledges them, GT standard errors of scores tended to be 
smaller (0.72 for CTT and 0.63 for GT). The IRT adjustments for construct-irrelevant variation 
further reduced the size of standard errors for IRT scores by about half. Our results generally 
indicated that the width of the 95% confidence intervals for IRT based scores was substantially 
tighter than their CTT and GT counterparts (Figure 4). For instance, for an average teacher the 
width of the confidence interval for IRT based scores spanned 1.2 standard deviations (e.g., 
0+0.6) whereas the widths of confidence intervals for CTT/GT averages were 2.52 standard 
deviations for GT (e.g., 0+1.26) and 2.88 standard deviations for CTT (e.g., 0+1.44). 

Despite the potential for item response models to improve the accuracy and precision of 
teacher scores, there is some question as to the validity of the adjustments made by IRT. Among 
other important assumptions (e.g., unidimensionality, invariance), our IRT approach assumes 
that the model based adjustments we made for construct-irrelevant variance are valid and 
accurate. Two sources of construct-irrelevant variation that our IRT model adjusted for were 
differences among rater severities and atypical chapters. To examine the validity of these 
adjustments, we correlated teacher value-added scores with CTT/GT and IRT scores with and 
without these adjustments. 

Our results suggested that our adjustments were of mixed value (Table 3). Use of an IRT 
model that did not adjust for construct-irrelevant variance (i.e., y r = 0 and a ctr = Oin equation (5)) 
shared nearly the identical relation with value-added scores as did CTT/GT scores (0.16). 
Adjustments for atypical chapters but not raters (i.e., y r = 0 equation (5)) improved the IRT score 
correlation by 25% to 0.20 and pushed it under the nominal /;- value cutoff of 0.05. In contrast, 
similar adjustments for raters diminished this relationship to 0.14 suggesting our simple 
adjustments for rater severities might be insufficient in describing the complex variation among 
raters. 

Conclusions: 

Overall the results suggest that construct-irrelevant variance is sizeable in classroom 
observations and that treatment of this variance had significant implications for the resulting 
scores. Although the authority of correlating teaching observation scores with value-added scores 
in validating the appropriateness of each method is unclear, the results suggested that there is 
much to be gained from methods which directly address construct-irrelevant variation. 
Specifically, our results suggested that we might be able to create more reliable, more precise 
and more differentiated indices of teaching quality as they relate to students' achievement by 
estimating the impact of different sources of construct-irrelevant variance. At the same time, our 
empirical application also highlighted the potential for erroneous adjustments. However, because 
classroom observations are potentially attached to high stakes decisions, ignoring construct- 
irrelevant variation does not seem like a viable option. More specifically, because the reliabilities 
of the averages under CTT and GT are so low and the confidence intervals they produce are so 
wide, it seems unlikely that decision makers will be willing to make evaluations amidst so much 
uncertainty. To this end, our results suggest that empirically based adjustments for construct- 
irrelevant variance are a promising, albeit complex, approach to understanding teaching quality 
through classroom observations. 
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Appendix B. Tables and Figures 


Table 1: Selected descriptive statistics of teachers 


Teacher Variable 

Mean/Percent 

Number of Mathematics courses taken 
None 

3.50 

One 

4.70 

Two 

9.90 

Three or more 

80.8 

White 

0.66 

Years of experience 

9.86 

Majored/minored in mathematics 

0.06 
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Table 2: Proportion of v ariance attributable to source 
Variance 


Teacher 0.11 
Raters 0.06 
Chapter 0.83 


SR EE Spring 2013 Conference Abstract Template 


B-2 



Table 3: Standardized regression coefficient of teachers' observation scores predicting their 
value-added scores 



Standardized coefficient 
(standard error) 

/-value 

CTT/GT 

0.15 

1.65 


(0.10) 

IRT without chapter or rater adjustments 

0.16 

1.65 


(0.10) 

IRT with chapter but without rater adjustments 

0.20* 

2.03 


(0.10) 

IRT with chapter and rater adjustments 

0.14 

1.44 


(0.10) 


*p< 0.05 
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Average Reliability 


Figure 1 : Reliability of scores by method 
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Averages 


Figure 2: Scatter plot of IRT and CTT/GT scores 
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Magnitude of Standard Error 


Figure 3: Size of standard error by method 
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Teacher 


Figure 4: Plot of teachers scores surrounded by 95% confidence intervals by scoring method 
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Teacher Scores 

Note: 

Orange dots: Scores based on CTT 

Blue bands: Confidence intervals based on CTT reliability 
Green bands: Confidence intervals based on GT reliability 
Black dots: Scores based on IRT 
Red: Confidence intervals based on IRT 


SREE Spring 2013 Conference Abstract Template 


B-7 


